Data Visualization: ggplot2 tutorial using gapminder dataset
If i can’t picture it, I can’t understand it. Albert Einstein
Overview
In this tutorial we look at some of the data on wealth and life expectancy of countries over time used by Hans Rosling, known as gapminder. The goal is to provide an overview of how to graph a variable (data) depending on its type, introduce some simple 1D and 2D plots constructed using ggplot2() and provide an outline of the layered grammar of graphics upon which ggplot2() is built.
Learning objectives
- Generate plots from data according to their type (discrete, continuous …)
- Manage plot settings
- Produce plots from data in a data frame
- Modify and customize a plot
- Create complex and fancy plot
Loading/installing packages
library(ggplot2)
library(dplyr)
library(scales)
library(gapminder)Let’s have a look to our data structure
str(gapminder)## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
The print() method gives an abbreviated printout.
gapminder## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
It is useful to get some overview of the variables before getting started.
summary(gapminder)## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
We will want to look at trends over time by continent. How many countries are in this data set in each continent? There are 12 years for each country. Are the data complete? table() gives an answer.
table(gapminder$continent, gapminder$year)##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Africa 52 52 52 52 52 52 52 52 52 52 52 52
## Americas 25 25 25 25 25 25 25 25 25 25 25 25
## Asia 33 33 33 33 33 33 33 33 33 33 33 33
## Europe 30 30 30 30 30 30 30 30 30 30 30 30
## Oceania 2 2 2 2 2 2 2 2 2 2 2 2
Note: we used the $ symbol with data$variable notation because table() doesn’t have a data= argument. Another way to do this is to use the with() function, that makes variables in a data set available directly. The same table can be obtained using:
with(gapminder, {table(continent, year)})## year
## continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Africa 52 52 52 52 52 52 52 52 52 52 52 52
## Americas 25 25 25 25 25 25 25 25 25 25 25 25
## Asia 33 33 33 33 33 33 33 33 33 33 33 33
## Europe 30 30 30 30 30 30 30 30 30 30 30 30
## Oceania 2 2 2 2 2 2 2 2 2 2 2 2
1D plots: Bar plots for discrete variables
As we have seen previously during the lecture, the distribution a discrete variable is better vizualised using a bar plot. For example, continent. With ggplot2, this is relatively easy:
- we start by mapping the
xvariable tocontinent - then, we add a
geom_bar()layer, that counts the observations in each category and plots them as bar lengths.
ggplot(gapminder, aes(x=continent)) + geom_bar()To make this more colorful, you can also map the fill attribute to continent.
ggplot(gapminder, aes(x=continent, fill=continent)) + geom_bar()With ggplot2 features, we will be able also to:
- change the default color schemes
- modify labels
- change the legend position, or eliminate it in same case
- flip axis …
Let’s try some !
- We will change the y axis,
count, ingeom_bar()to..count../12in order to represent the number of countries. - Change the label of the y axis by a more meaningful one:
countries - Suppress the default legend for continent, which is redundant in this case
ggplot(gapminder, aes(x=continent, fill=continent)) +
geom_bar(aes(y=..count../12)) +
labs(y="Number of countries") +
guides(fill=FALSE)Note: Ever plot in ggplot2 is a ggplot object.
If you want to save a given plot for a future use, store it in a variable by using: mybar <- ggplot() + ... or simply by mybar <- last_plot()
mybar <- last_plot()Some other ggplot2 features
- Transforming coordinates using
coord_transfunction
mybar + coord_trans(y="sqrt")- Flipping axes using
coord_flipfunction
mybar + coord_flip()- Transform to polar coordinates
mybar + coord_polar()1D plots: density plots for continuous variables
The gapminder data set contains several continuous variables: life expectancy (lifeExp), population (pop) and gross domestic product per capita (gdpPercap) for each year and country. For such variables, density plots provide a useful graphical summary.
Let’s start by exploring life expectancy. The simplest plot uses this as the horizontal axis, aes(x=lifeExp) and then adds geom_density() to calculate and plot the smoothed frequency distribution.
ggplot(data=gapminder, aes(x=lifeExp)) +
geom_density()We have several features to make this plot prettier. Changing the line thickness (size=), add a fill color (fill=""), and make the fill color partially transparent (alpha=).
ggplot(data=gapminder, aes(x=lifeExp)) +
geom_density(size=1.5, fill="pink", alpha=0.3)Differences by continent
The plot of lifeExp is bimodal, and looks not obvious. We need to add another aesthetic attribute, fill=continent, which is inherited in geom_density() to see more details about countries among continents.
ggplot(data=gapminder, aes(x=lifeExp, fill=continent)) +
geom_density(alpha=0.3)Note 1: We used transparent colors (alpha=) to see more clearly the different distributions across continent.
Note 2: It is easy now to see that African countries differ markedly from the rest.
boxplots and other visual summaries
You might want to visualize the distributions of life expectancy by another visual summary, grouped by continent. All you need to do is change the aesthetic to show continent on one axis, and life expectancy (lifeExp) on the other.
gap1 <- ggplot(data=gapminder, aes(x=continent, y=lifeExp, fill=continent))Then, add ageom_boxplot() layer:
gap1 +
geom_boxplot(outlier.size=2)Challenge 1
- Remove the legend from this plot
- Make the plot horizontal
- Instead of a boxplot, try
geom_violin()
Effect ordering
The continents are a factor, and are ordered alphabetically by default. It might be more useful to order them by the mean or median life expectancy.
In this example, I use the dplyr “pipe” notation (%>%) to send the gapminder data to the dplyr:;mutate() function, and within that, reorder() the continents by their median life expectancy.
gapminder %>%
mutate(continent = reorder(continent, lifeExp, FUN=median))## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
Note: In other situations, you could use FUN=mean, FUN=sd, or FUN=max to sort the levels by their means, standard deviatons, maximums, or any other function.
We can now pipe the result of this right into ggplot:
gapminder %>%
mutate(continent = reorder(continent, lifeExp, FUN=median)) %>%
ggplot(aes(x=continent, y=lifeExp, fill=continent)) +
geom_boxplot(outlier.size=2)Exploring at GDP
Let’s look at the distribution of gdpPercap in a similar way, starting with the unconditional distribution.
ggplot(data=gapminder, aes(x=gdpPercap)) +
geom_density() Challenge 2
- As we did for
lifeExpplot the distributions separately for each continent - It is probably more useful to plot GDP on a log scale. Add another layer that transforms the
xaxis tolog10(gdpPercap). - Make boxplots of
gdpPercapbycontinent. - Do the same, but plot GDP on a log scale.
1.5D: Layers & Time series plots
Layers
Exploring how life expectancy change with GDP per country, for expample china. We can use geom_line to make a line plot.
china <- ggplot(subset(gapminder, country =="China"), #subsetting data
aes(x=gdpPercap, y=lifeExp))
china + geom_line() We can use both geom_line and geom_point to make a line plot with points at the data values.
china + geom_line() + geom_point() #adding points to data values Note: This brings up another important concept with ggplot2: layers. A given plot can have multiple layers of geometric objects, plotted one on top of the other.
If we make the lines and points different colors, we can see that points are placed on top of the lines, since they are in the second layer.
china + geom_line(color="lightblue") + geom_point(color="violetred") #adding some colors If we switch the order of geom_point() and geom_line(), we’ll reverse the layers.
china + geom_point(color="violetred") + geom_line(color="lightblue") Note: aesthetics that are included in the call to ggplot2() (or completely separately) are made to be the defaults for all layers, but we can separately control the aesthetics for each layer. For example, we could color the points by year:
china + geom_line() + geom_point(aes(color=year)) #color the point by yearWith a rainbow:
china + geom_line() + geom_point(aes(color=year))+ scale_color_gradientn(colours = rainbow(5)) #with a rainbowColoring both points and lines:
china + geom_line() + geom_point() + aes(color=year) #coloring both point and linechina + geom_line() + geom_point() + aes(color=year)+ scale_color_gradientn(colours = rainbow(5)) #both with rainbow shadeChallenge 3
- Make a plot of
lifeExpvsgdpPercapfor China and India, with both lines and points.
Time series plot
Exploring how has life expectancy changed over time. The simplest way to to plot a line for each country over year. To do this, we use the group aesthetic.
ggplot(gapminder, aes(x=year, y=lifeExp, group=country)) + #using the aesthetic group
geom_line()Adding colors:
ggplot(gapminder, aes(x=year, y=lifeExp, group=country , color = continent)) + #adding color
geom_line()Changing colors shade:
ggplot(gapminder, aes(x=year, y=lifeExp, group=country , color = continent)) + #changing colors shade
geom_line(alpha = 0.5)Plotting a summary
A better look at trends over time is to find the mean or median for each year and continent and plot those.
gapminder %>%
group_by(continent, year) %>%
summarise(lifeExp=median(lifeExp)) %>% head() #median for each year and continent## # A tibble: 6 x 3
## # Groups: continent [1]
## continent year lifeExp
## <fct> <int> <dbl>
## 1 Africa 1952 38.8
## 2 Africa 1957 40.6
## 3 Africa 1962 42.6
## 4 Africa 1967 44.7
## 5 Africa 1972 47.0
## 6 Africa 1977 49.3
One nice feature of the dplyr and tidyverse framework, is that you can pipe the result of such a summary directly to ggplot():
gapminder %>% #piping to ggplot
group_by(continent, year) %>%
summarise(lifeExp=median(lifeExp)) %>%
ggplot(aes(x=year, y=lifeExp, color=continent)) +
geom_line(size=1) +
geom_point(size=1.5)If you want to make several plots of such a summarized data set, save the result in a new object.
gapminder %>% #saving in a new dataset using assignement
group_by(continent, year) %>%
summarise(lifeExp=median(lifeExp)) -> gapyearLet’s play with our plot and make it more fancy!
We can fit linear regression lines for each continent instead of joining all the points:
ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) + #fitting linear regression lines for each continent
geom_point(size=1.5) +
geom_smooth(aes(fill=continent), method="lm")We can also use a loess smooth rather than a linear regression:
ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) + #using a loess smooth
geom_point(size=1.5) +
geom_smooth(aes(fill=continent), method="loess")We can change the default use of legends by placing it inside the plot:
ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) + #using a loess smooth
geom_point(size=1.5) +
theme(
legend.position = c(0.99, 0.03),
legend.justification = c("right", "bottom") #placing the legend inside the plot
)+
geom_smooth(aes(fill=continent), method="loess")2D: Scatterplots
Let’s explore the relationship between life expectancy and GDP with a scatterplot,
A basic scatterplot is set up by assigning two variables to the x and y aesthetic attributes then we can add the points in another layer.
plt <- ggplot(data=gapminder,
aes(x=gdpPercap, y=lifeExp))
plt + geom_point()Or, color them by continent.
plt + geom_point(aes(color=continent)) #adding color by continentFor a better look, we can also add a smoothed curve for all the data:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") #adding a smoothed curve for all the dataAs we have seen earlier about GDP, this variable is better plotted on a log scale:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10() #plotting on a log scaleCustomizing the plot
The last plot, on the log scale has ugly labels, let’s try to adjust the scale:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10(labels=scales::comma) #adjusting scaleMoving the legends inside the plot:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10(labels=scales::comma) +
theme(legend.position = c(0.8, 0.2)) # putting the legend inside the plotChanging the theme:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10(labels=scales::comma) +
theme_bw() #changing the theme of the plotReplacing the single loess smoothed curve with a separate regression line for each continent:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="lm") +
scale_x_log10(labels=scales::comma) +
theme_bw() #smoothing by a regression line for each continentMaking a “bubble” plot, mapping the size of each point to population (pop)
plt + geom_point(aes(size = pop)) + #making a bubble plot by mapping the size of each point to population
geom_smooth(method="lm") +
scale_x_log10(labels=scales::comma) +
theme_bw()Changing color shades:
plt + geom_point(aes(size = pop), alpha = 0.5) + #changing colors shade
geom_smooth(method="lm") +
scale_x_log10(labels=scales::comma) +
theme_bw()Let’s explore life expectancy by continent for a giving year. To do that, we will need to filter our data.
gm_2007 <- subset(gapminder, year==2007) #filtering data by picking those of 2007
ggplot(gm_2007, aes(y=lifeExp, x=continent)) + geom_point()ggplot(gm_2007, aes(y=lifeExp, x=continent)) +
geom_point(position=position_jitter(width=0.1, height=0)) #changing scale by jittering Advanced customized and fancy plot
Bubble plot
Explorinf gdp versus life expectancy in 2007 with highlighting the larger countries filter our data.
ggplot(gm_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop),# add scatter points
alpha = 0.5) +
geom_text(aes(x = gdpPercap, y = lifeExp + 3, label = country), # add some text annotations for the very large countries
color = "grey50",
data = filter(gm_2007, pop > 1000000000 | country %in% c("Nigeria", "United States"))) +
scale_x_log10(limits = c(200, 60000)) + # clean the axes names and breaks
labs(title = "GDP versus life expectancy in 2007", # change labels
x = "GDP per capita (log scale)",
y = "Life expectancy",
size = "Popoulation",
color = "Continent") +
scale_size(range = c(0.1, 10), # change the size scale
guide = "none") + # remove size legend
theme_classic() + # add a nicer theme
theme(legend.position = "top", # place legend at top and grey axis lines
axis.line = element_line(color = "grey85"),
axis.ticks = element_line(color = "grey85"))